This work is collaborate with Chang Xiao. Here is the specific detail of our individual contribution.
Network Design (Chang, Ye)
Data preparation (Chang)
Training (Chang)
Results Evaluation (Ye)
Physically based rendering has been studied for a long time in computer graphics community. To obtain a photo realistic image, one require solving the rendering equation [1]:
$$ L_o(x, \omega_o, \lambda, t) = L_e(x, \omega_o, \lambda, t) + \int f_r(x, \omega_i, \omega_o, \lambda, t) L_i(x, \omega_i, \lambda, t) (\omega_i \cdot n) \,\mathrm {d} \omega_i $$
This prblem can be solved by Monte Carlo integration. Monte Carlo (MC) rendering systems approximate this integral by tracing light rays (samples) in the world space to evaluate the scene function. Although an approximation with just a few samples to this integral can be quickly evaluated, the inaccuracy of this estimate relative to the true value produces unacceptable noise in the resulting image. Since the variance of the MC estimator decreases linearly with the number of samples, many samples are required to get a reliable estimate of the integral(see below). The high cost of computing additional rays results in lengthy render times that negatively affect the applicability of MC renderers in modern film production. To tackle this problem, we present a new deep learning based framework, that can produce an approximate image with large number of samples from a low sample input. To do this, we train a neural network on a set of noisy MC rendered images and their corresponding ground truth images, using a convolutional neural network tightly coupled with ResNet[2].
Our task is to learn high sample images from low sample images, which can significantly reduce the time to get images with high quality.
We use Convolutional Neural Network to solve this problem.

The first thing we need to do for this project is preparing our training data. We use the most popular open source software for 3D modeling and rendering, called Blender[3]. By using blender, we construct over 100 different image pair for both high sample and low sample as our training data. Most of them have resolution larger than 1280x960.
We adopt VGG16 (without top layer) as our base model and add several Residual Network Blocks on the top of it.

We generate training samples by randomly cut patches of size (224, 224) from low sample images and corresponding high sample images.
Since physically based rendered images share very similar feature with natural image, we adopt VGG16 as our base layers. The input of our network is RGB images patches with shape (224, 224, 3). The output of VGG16, which has shape(7, 7, 512) is then sent into several ResNet blocks with upsampling. The final output recovers the shape (224, 224, 3), which are the denoised image patches.
We define the final loss using the MSE loss between the network output image patches and ground truth image patches.
We made a lot of attempts before determining the final model.
At first, we tried Multi-Layer Perceptron, inspired by the article "Image denoising with multi-layer perceptrons"[6]. At the same time, we designed a simple fully convolutional network. By comparing the performance of MLP and FCN, we decided to put effort on CNN models.
Then we designed a deep fully convolutional modal. Strangely the color of output is somewhat duller than ground true images. We found that it is because the size of patches is too small (28, 28). When we increased the size of patches, the color of ouput became aprroximate to ground true images.
To improve the performance, we adopted the pre-trained VGG16 as our base model, but excluding the top dense layers. We also change the size of patches to (224, 224), which is the default size of VGG16 input. And we made it further by adding several Residual Network blocks on the top of VGG16 to gain lower loss. Now we have our final model.
First we construct our network structure.
%matplotlib inline
import os,random
os.environ["KERAS_BACKEND"] = "tensorflow"
import numpy as np
import theano as th
import theano.tensor as T
from keras.utils import np_utils
import keras.models as models
from keras.layers import Input,merge
from keras.layers.core import Reshape,Dense,Dropout,Activation,Flatten
from keras.layers.advanced_activations import LeakyReLU
from keras.activations import *
from keras.layers.wrappers import TimeDistributed
from keras.layers.noise import GaussianNoise
from keras.layers.convolutional import Conv2D, Convolution2D, MaxPooling2D, ZeroPadding2D, Deconv2D, UpSampling2D
from keras.layers.recurrent import LSTM
from keras.callbacks import ModelCheckpoint
from keras.regularizers import *
from keras.layers.normalization import *
from keras.optimizers import *
from keras.datasets import mnist
import matplotlib.pyplot as plt
import cPickle, random, sys, keras
from keras.models import Model
from IPython import display
sys.path.append("../common")
from keras.utils import np_utils
#K.set_image_dim_ordering('th')
def create_network():
#fisrt we load the pre-trained vgg16 network
from keras.applications.vgg16 import VGG16
input_tensor = Input(shape=(224, 224, 3))
base_model = VGG19(input_tensor=input_tensor,weights='imagenet', include_top=False)
for layer in base_model.layers:
layer.trainable = False
x = base_model.output
#the output of VGG connected to several resnet block, combined with upsampling.
a = Conv2D(256, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.merge.Add()(a,x)
x = UpSampling2D(size=(2,2))(x)
a = x = Conv2D(128, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.merge.Add()(a,x)
x = UpSampling2D(size=(2,2))(x)
a = Conv2D(64, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.merge.Add()(a,x)
x = UpSampling2D(size=(2,2))(x)
a = Conv2D(32, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.merge.Add()(a,x)
x = UpSampling2D(size=(2,2))(x)
a = Conv2D(32, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.merge.Add()(a,x)
x = UpSampling2D(size=(2,2))(x)
x = keras.layers.concatenate([x, input_tensor])
x = Conv2D(64, (3, 3), activation='relu', padding='same')(x)
x = Conv2D(32, (3, 3), activation='relu', padding='same')(x)
out = Conv2D(3, (3, 3), activation='sigmoid', padding='same')(x)
model = Model(inputs=base_model.input, outputs=out)
#the final model use mse error with adadelta optimizer
model.compile(optimizer='adadelta', loss='mean_squared_error')
return model
m_model = create_network()
m_model.summary()
Then we generate training 10000 patches.
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
import cv2
datagen = ImageDataGenerator(rescale=1./255)
lowsample_data_dir = '../processed_data/train/low/'
highsample_data_dir = '../processed_data/train/high/'
def random_generate_from_data(sample_size, patch_size):
low_imgs=[]
high_imgs=[]
low_imgs.append(cv2.imread('../raw_data/low/blenderman.png'))
high_imgs.append(cv2.imread('../raw_data/high/blenderman.png'))
low_imgs.append(cv2.imread('../raw_data/low/classroom_low.png'))
high_imgs.append(cv2.imread('../raw_data/high/classroom_high.png'))
low_imgs.append(cv2.imread('../raw_data/low/pa_low.png'))
high_imgs.append(cv2.imread('../raw_data/high/pa_high.png'))
lows = []
highs = []
for i in range(sample_size):
num = np.random.randint(len(low_imgs))
low = low_imgs[num]
high = high_imgs[num]
x_max = low.shape[0] - patch_size
y_max = low.shape[1] - patch_size
x = np.random.randint(x_max)
y = np.random.randint(y_max)
low_sample = low[x:x+patch_size, y:y+patch_size, :]
high_sample = high[x:x+patch_size, y:y+patch_size, :]
lows.append(low_sample)
highs.append(high_sample)
return np.array(lows), np.array(highs)
#lowsampleimgs = load_images_from_folder(lowsample_data_dir)
#lowsampleimgs = np.array(lowsampleimgs)
lowsampleimgs, highsampleimgs = random_generate_from_data(10000, 224)
lowsampleimgs = lowsampleimgs.astype('float32') / 255.
highsampleimgs = highsampleimgs.astype('float32') / 255.
print np.max(highsampleimgs)
print "train low sample img shape: ", lowsampleimgs.shape
print "train high sample img shape: ", highsampleimgs.shape
Here we show some low sample input and corresponding ground truth
for i in range(5):
plt.figure(i)
plt.grid(b=False)
plt.subplot(221)
plt.imshow(lowsampleimgs[i,:,:,:])
plt.subplot(222)
plt.imshow(highsampleimgs[i,:,:,:])
plt.show()
Now we start to train our network.
checkpointer = ModelCheckpoint(filepath="./Models/model_x.hdf5", verbose=0)
m_model.fit(lowsampleimgs, highsampleimgs,
epochs=100,
batch_size=30,
shuffle=True,
callbacks=[checkpointer])
Then we unfreeze the VGG layer and train for another 10 epochs.
#unfreeze and train
for layer in m_model.layers:
layer.trainable = True
m_model.fit(lowsampleimgs, highsampleimgs,
epochs=10,
batch_size=30,
shuffle=True,
callbacks=[checkpointer])
We got a final loss of 0.4082
First we evaluate our results using small patches.
from keras.models import load_model
m_model = load_model('dl/GAN/Models/model_x.hdf5')
result = m_model.predict(testimgs[:10])
for i in range(10):
plt.figure(i)
plt.subplot(131)
plt.imshow(lowsampleimgs[i])
plt.subplot(132)
plt.imshow(highsampleimgs[i])
plt.subplot(133)
plt.imshow(result[i])
plt.show()
From left to right are the low sample input image, high sample ground truth and our output prediction.
We can see that our output has much less noise compare to the input, also it is very similar to the ground truth.
Then we show some result for a whole image. We split a high resolution image with low sample into several small patch whose size is (224,224,3), after send into the network, we combine them together to get the final output.
print("Low sample input")
Image(filename='./data/low/blenderman.png')
print("Ground truth")
Image(filename='./data/high/blenderman.png')